Statistical Inference After Model Selection

Berk, Brown and Zhao (2010)

Abraham Iñiguez and Marwin Carmo

Overview of the presentation

Background

\[\begin{equation} y_i = \beta_0 + \beta_1X_i + \varepsilon_i \end{equation}\]
  • Classic regression analysis treats predictors as fixed or non-random.

  • Allows estimation of unbiased coefficients.

Motivations

  • Berk and colleagues (2010) claim Model Selection is ubiquitous in Criminology.

  • When the correct model is unknown…

    • Set of Models
    • Data is examined to choose the “best” model
    • Parameters are estimated
    • Statistical inference
  • However, what happens when the same dataset is used?

Why is model selection an issue?

  • Sampling distributions become distorted

  • Estimates are biased (e.g., Beta coefficients, standard errors)

  • Overconfidence in results (inflated Type 1 error)

Why is model selection an issue?

Specific Mechanisms

  • Direct Censoring

  • Indirect Censoring

  • Alterations in dispersion of regression parameter estimates

Simulations of Model-Selection

  • Forward stepwise regression using AIC applied to 10,000 samples of \(n\) = 200.

  • At each step, the term is added that leads to the model with the smallest AIC. The procedure stops when no remaining regressor improves the AIC.

\[\begin{equation} y_i = \beta_0 + \beta_1w_i + \beta_2x_i + \beta_3z_i + \varepsilon_i \end{equation}\]
  • The predictors follow a multivariate normal distribution:

\[ \begin{bmatrix} w \\ x \\ z \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 5.0 & 4.0 & 5.0 \\ 4.0 & 6.0 & 5.0 \\ 5.0 & 5.0 & 7.0 \end{bmatrix} \right) \]

  • The fixed effects coefficients are \(\beta_0\) = 3.0, \(\beta_1\) = 0.0, \(\beta_2\) = 1.0, and \(\beta_3\) = 2.0.

  • The error term follows a normal distribution \(\varepsilon \sim \mathcal{N}(0, 10.0)\).

Simulations of Model-Selection

reps  <-  10000
p <- 3
Sigma <- matrix(c(5,4,5,
                  4,6,5, 
                  5,5,7), p, p)
n <- 200
betas <- c(3, 0, 1, 2)

rsq <- NULL
coefs <- cover <- matrix(NA, nrow = reps, ncol = 3)
colnames(coefs) <- c("w", "x", "z")
colnames(cover) <- c("w", "x", "z")

Simulations of Model-Selection

reps  <-  10000
p <- 3
Sigma <- matrix(c(5,4,5,
                  4,6,5, 
                  5,5,7), p, p)
n <- 200
betas <- c(3, 0, 1, 2)

rsq <- NULL
coefs <- cover <- matrix(NA, nrow = reps, ncol = 3)
colnames(coefs) <- c("w", "x", "z")
colnames(cover) <- c("w", "x", "z")

for (i in seq(reps)) {
  X <-  MASS::mvrnorm(n = n, rep(0, 3) , Sigma)
  y <- as.numeric(cbind(1, X) %*% betas + rnorm(n, 0, 10))
  Xy <- as.data.frame( cbind(X, y))
  colnames(Xy) <- c(c("w", "x", "z"), "y")
  fit <- lm(y ~., data = Xy)
  sel <- step(fit, k = 2, trace = FALSE)
  s <- summary(sel)
  tvals <- s$coefficients[,3][-1]
  coefs[i, names(tvals)] <-  tvals
  rsq[i] <- s$r.squared
}

Distribution of models selected in 10,000 draws

None W X Z WX WZ XZ WXZ
Berk et al. 0 0 0.0001 17.4 1.0 4.9 65.7 10.8
Replication 0 0.02 0.03 17.8 1.16 4.35 65.23 11.41

Simulation results

  • The \(R^2\)s varied over the simulations between about .3 and .4.

  • For \(X\) the post-model selection t-values distribution has a greater mean (2.6–2.2) and a smaller standard deviation (.79–1.0).

  • For \(Z\) the mean and the standard deviation are biased substantially upward: from 4.9 to 5.5 for the mean and from 1.0 to 2.3 for the standard deviation.

Statistics (our results)

Predictor Coverage Estimate t-value Inclusion (%) Bias MSE
W 0.719 0.117 0.264 16.94 -1.335 4.470
X 0.968 1.179 2.595 77.83 -0.325 1.492
Z 0.963 2.064 5.400 98.79 0.562 1.845

\[ \text{Bias} = \frac{1}{S}\sum^S_{s=1}{\left( \hat{\beta}_{p,s} - \beta_p\right)}, \quad \text{MSE} = \frac{1}{S}\sum^S_{s=1}{\left( \hat{\beta}_{p,s} - \beta_p\right)^2} \]

t-values for \(\beta_2\)

  • Red curve/ solid line = Conditional on preferred model being known

  • Blue curve/ broken line = Conditional on predictor being included in a model

t-values for \(\beta_3\)

  • Red curve/ solid line = Conditional on preferred model being known

  • Blue curve/ broken line = Conditional on predictor being included in a model

False power

  • For the preferred model, power to reject \(H_0: \beta_2=0\) with \(\alpha =0.05\) is approximately 60%.

  • After model selection, that probability is about 76%.

  • Bias due to model selection artificially inflates power in about 27%.

Selecting the preferred model (\(X\) and \(Z\) included) does not guarantee any of the desirable properties of the regression coefficient estimates.

Summary

  • The post-model-selection sampling distribution can take on a wide variety of shapes, even when the researcher happens to have selected the model accurately representing how the data were generated.

  • Inference from these models can lead to biased regression parameter estimates and misleading statistical tests and confidence intervals.

  • The particular selection procedure used does not materially matter.

Potential solutions

  • Split sample in training and test samples (😦)

  • Collect two random samples (😨)

  • Derive a theoretically based appropriate model (😰)

  • Differentiate between confirmatory and exploratory analysis (🤯)

  • Should all else fail, forego formal statistical inference altogether (☠️)

Discussion

  • What did the authors do particularly well?
    • Simulation conditions were well specified
    • Good framing of the problem and laying out potential solutions
  • Is there enough detail to replicate their simulation?
    • As we have shown, there is!
  • What is lacking?
    • Variations to show when post-model selection inference is really bad.
    • Heatmaps!

Standardized measures of Bias, Coverage, Mean Square Error (MSE) and R2, across each simulation of pairs of number of predictors and correlation values.

Try my ShinyApp!